The aim here is to build a full list of proteins to test any data source against for coverage. There are three sources for the genes:
The first two should be easy, but the third I'm going to have to figure out. Start by loading in the bait and prey lists:
In [1]:
cd /home/gavin/Documents/MRes/forGAVIN/pulldown_data/
In [3]:
#loading in bait list
import csv
f = open("BAITS/baits_entrez_ids.csv")
c = csv.reader(f)
baitids = []
for l in c:
#to avoid empty rows
if l:
baitids.append(l[0])
In [21]:
#loading in prey list
f = open("PREYS/prey_entrez_ids.csv")
c = csv.reader(f)
preyids = []
for l in c:
#to avoid empty rows
if l:
preyids.append(l[0])
preyids = preyids[:-2]
Used by Qi as a reliable source of true protein interactions. The easy route is to use the list provided by the Qi paper Then just remove duplicates and convert to Entrez.
Another approach would be to retreive the full mouse and human datasets from the actual DIP website. Appears they are provided in the form of FASTA files and then I suppose it would be possible to find the associated Entrez-ID. I'm not sure, but that might have to be done with BLAST.
So basically, the question is how to retreive an Entrez-ID from a sequence. Let's see if I can do this with Biopython:
In [ ]:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "gavingray1729@gmail.com"
In [16]:
testhandle = Entrez.efetch(db="nucleotide", id=baitids[0], rettype="gb", retmode="text")
testrecord = SeqIO.read(testhandle, 'genbank')
So now we have a sequence that we want to use to get back the original Entrez-ID number:
In [18]:
testrecord.seq
Out[18]:
Cancel that, talked to Colin.
Turns out the FASTA records have uniprot identifiers in them so we can just batch convert the uniprot IDs to Entrez-IDs using the tools on the uniprot web site.
So the important thing now is to get a list of all the human proteins that have interaction data in the DIP database. Looking around the website. This page contains files with all the proteins in the DIP database.
It's not split by species, so can either use that file and split it by species myself or take the human interaction file and remove the duplicates myself. Looking at both files.
In [23]:
cd /home/gavin/Documents/MRes/DIP/
In [24]:
ls
In [30]:
%%bash
wc -l Hsapi20140427.txt
wc -l fasta20131201.seq
head -n 2 Hsapi20140427.txt
Well, the Hsapi20140427.txt
file isn't very nice to look at.
Appears some of the lines contain proteins with uniprot identifiers and some only have the DIP's own reference number.
Would be useful to know if the full list also has lines with no refseq or uniprot identifiers.
Can do this with grep (instructions here):
In [36]:
%%bash
grep "dip" fasta20131201.seq | grep -v "uniprot" | grep -v "refseq" > norefeqoruniprot
head norefeqoruniprot
wc -l norefeqoruniprot
Looks like there's 296 in there then. Better make sure we can convert this DIP identifier to Entrez. At the moment don't know of a way to do this.
Well, giving up on that file now, because even if I can solve that problem I'll still have to come up with a way to find the human proteins in there.
So I should focus on parsing the the Hsapi20140427.txt
for the proteins.
Could just disregard lines which can't be mapped to Entrez-IDs?
In [61]:
c = csv.reader(open("Hsapi20140427.txt"), delimiter="\t")
In [62]:
#skip first line
c.next()
#take out the first two strings for every line
IDstrings = map(lambda x: (x[0],x[1]), c )
The below uniqueifier is from this stackoverflow post.
In [72]:
from collections import OrderedDict
# would quite like to flatten this list
# then remove duplicate entries
IDstrings = list(OrderedDict.fromkeys(flatten(IDstrings)))
print "Number of proteins in DIP human dataset is %i"%len(IDstrings)
In [73]:
print "Example entries:"
print IDstrings[0:10]
Next have to remove entries that only have DIP identifiers because I can't work with those.
Or I could just split all the strings by the |
and split it up into the different identifier types and push all three files through the uniprot web system.
In [115]:
splitstrings = map(lambda x: x.split("|"), IDstrings)
# then have to sort into three lists
# get ready for some super readable code
# who needs programming efficiency?
DIPIDs = []
uniprotIDs = []
refseqIDs = []
for line in splitstrings:
#can only look at the strings themselves to sort
for s in line:
#test for each case:
if "DIP" in s:
DIPIDs.append(s)
elif "uniprot" in s:
uniprotIDs.append(s)
elif "refseq" in s:
refseqIDs.append(s)
In [95]:
csv.writer(open("DIPIDs.txt", "w"),delimiter="\n").writerow(DIPIDs)
csv.writer(open("uniprotIDs.txt", "w"),delimiter="\n").writerow(uniprotIDs)
csv.writer(open("refseqIDs.txt", "w"),delimiter="\n").writerow(refseqIDs)
Using the uniprot website to map these to Entrez. Starting with the DIP IDs, got two files out:
Interestingly, though, these two files are completely different lengths. Which could mean that a number of the DIP IDs map to the same uniprot ID. Checking if this is the case:
In [131]:
#load in DIP mapping table
c = csv.reader(open("DIPmappingtable.tab"), delimiter="\t")
#ignore first row
c.next()
# convert to lists
DIPmap = list(c)
# unpack and zip to get uniprot
u = zip(*DIPmap)[1]
# how long is it to start with
print "Before removing duplicates length of uniprot is %i"%len(u)
# then remove duplicates
u = list(OrderedDict.fromkeys(u))
# how long is it now?
print "After removing duplicates length of uniprot is %i"%len(u)
Ok, so what the heck is the target list in that case?
Turns out looking at that list file I actually downloaded an xml file instead of what I wanted to download. Downloaded the right file and it matches the above after removing duplicates so the target list is what I initially thought it was.
Naively continuing - Adding this list of uniprot identifiers to the list of uniprot identifiers we already have and saving this file:
In [116]:
#remove the uniprot preceding in uniprotIDs:
uniprotIDs = map(lambda x: x.split(":")[1],uniprotIDs)
In [109]:
csv.writer(open("uniprotplusDIPderived.txt", "w"),delimiter="\n").writerow(uniprotIDs+u)
Running this through uniprot website to convert to Entrez-ID threw a warning:
Only IDs from the file are being mapped.
Didn't really know what to make of this, so tried the uniprotIDs taken straight from the DIP file. It throws the same error by itself and fails to match any Entrez-IDs. Is there some syntax problem?
In [119]:
print uniprotIDs[0]
print u[0]
Appears some number of these even have the same name so seems like it shouldn't be a syntax problem. Should probably remove duplicates before sending this to uniprot. Some of the entries also have extra numbers after dashes I could try slicing off as well.
In [136]:
#remove duplicates
totaluniprot = list(OrderedDict.fromkeys(u+uniprotIDs))
#slice off anything extra
totaluniprot = map(lambda x: x.split("-")[0], totaluniprot)
In [137]:
print "Length before removing duplicates was %i and after is now %i."%(len(u+uniprotIDs),len(totaluniprot))
In [138]:
#rewrite the file
csv.writer(open("uniprotplusDIPderived.txt", "w"),delimiter="\n").writerow(totaluniprot)
Putting this into uniprot and still getting the same error and still unsure why that is. Ignoring the warning for now.
Put the file into uniprot converter and 3,592 of 3,763 were mapped. Saved the mapping table.
Now just have to convert the refseq list to Entrez aswell then combine the lists and remove the duplicates. Then also map back to the sequences in the FASTA file. Then also use the above code to do the same for the mouse database (as we might get better coverage on that for features).
Also, will need to map back to the interaction table to know which of these actually interacts. This can be done with the DIP identifiers, which I can now map back to with these tables.
In [128]:
#first have to cutoff the "refseq:" in the refseqIDs file and rewrite
refseqIDs = map(lambda x: x.split(":")[1],refseqIDs)
csv.writer(open("refseqIDs.txt", "w"),delimiter="\n").writerow(refseqIDs)
Putting this into uniprot mapped 2,135 of 2,189. Using this mapping table to add to the uniprot list as before:
In [130]:
#load in refseq mapping table
c = csv.reader(open("refsequniprotmap.tab"), delimiter="\t")
#ignore first row
c.next()
# convert to lists
refseqmap = list(c)
# unpack and zip to get uniprot
ur = zip(*refseqmap)[1]
# how long is it to start with
print "Before removing duplicates length of uniprot is %i"%len(u)
# then remove duplicates
ur = list(OrderedDict.fromkeys(u))
# how long is it now?
print "After removing duplicates length of uniprot is %i"%len(u)
Wait a second, how is the mapping table longer than the number of identifiers mapped? Is this just the general mapping table for all proteins or something?
Nope, looks like the refseq IDs map onto multiple uniprot IDs...
Combining everything together into a real total uniprot list:
In [139]:
#remove duplicates
totaluniprot = list(OrderedDict.fromkeys(u+uniprotIDs+ur))
#slice off anything extra
totaluniprot = map(lambda x: x.split("-")[0], totaluniprot)
In [140]:
print "Length before removing duplicates was %i and after is now %i."%(len(u+uniprotIDs+ur),len(totaluniprot))
In [141]:
#write the file
csv.writer(open("total.uniprot.txt", "w"),delimiter="\n").writerow(totaluniprot)
Finally, putting this file into the uniprot converter to get Entrez-IDs:
4,261 out of 4,445 identifiers mapped to 3,760 identifiers in the target data set
And looking at the final list of Entrez-IDs:
In [142]:
%%bash
head total.Entrez.txt
To reiterate I would like to:
Starting with number 1, going to first reorganise this directory a bit:
In [144]:
%%bash
mkdir human
mv * human/
mv human/fasta20131201.seq .
In [146]:
%%bash
mkdir mouse
In [181]:
# took the above code and combined it into a script
from DIPscripts import DIPgetproteinIDs
In [177]:
cd mouse
In [164]:
ls
In [186]:
uniprotIDs = DIPscripts.DIPgetproteinIDs("Mmusc20140427.txt")
In [180]:
ls
Then just have to retreive mapping tables as before by hand. Interestingly, looks like the protein interaction data for human is actually larger than that for mouse. For the DIP identifiers:
1,689 out of 1,873 identifiers mapped to 1,680 identifiers in the target data set
For the refseq identifiers:
1,065 out of 1,079 identifiers mapped to 1,515 identifiers in the target data set
Then just need another script to use the code above to combine the three files into one total uniprot file and then I can put that into the coverter and get another list of Entrez-IDs for mouse.
In [216]:
cd ..
In [217]:
reload(DIPscripts)
Out[217]:
In [218]:
cd mouse/
In [219]:
ls
In [220]:
totaluniprotIDs = DIPscripts.gettotaluniprot("DIPtouniprot.tab","refseqtouniprot.tab",uniprotIDs)
Then plugging the full uniprot file into the uniprot converter I get:
1,686 out of 1,815 identifiers mapped to 1,702 identifiers in the target data set
Downloaded the target list and having a look at it here:
In [221]:
%%bash
head total.Entrez.txt
Now:
Moving onto the next thing. Mapping back to the interaction tables. This is actually pretty similar to the other one and I think should be pretty easy with dictionaries. Using the mapping tables to build a converter from Entrez ID to DIP ID.
Had errors writing this part because I forgot to change directory before starting and mixed up the human and mouse variables:
In [334]:
cd /home/gavin/Documents/MRes/DIP/human
Had to also generate the uniprottoEntrez.tab
:
4,261 out of 4,445 identifiers mapped to 3,760 identifiers in the target data set
Also renaming Dipmappingtable.tab
:
In [319]:
%%bash
mv DIPmappingtable.tab DIPtouniprot.tab
In [335]:
# will need two dictionaries to do this
# one for Entrez - uniprot
c = csv.reader(open("uniprottoEntrez.tab"), delimiter="\t")
#skip first line
c.next()
# extract to lists
cl = list(c)
# initialise dictionary
uniEntrez = {}
# fill with empty lists
for l in cl:
uniEntrez[l[1]] = []
#then fill these lists
for l in cl:
uniEntrez[l[1]].append(l[0])
# and one for uniprot - DIP1
c = csv.reader(open("DIPtouniprot.tab"), delimiter="\t")
c.next()
cl = list(c)
DIPuni = {}
for l in cl:
DIPuni[l[1]] = []
for l in cl:
DIPuni[l[1]].append(l[0])
def EntreztoDIP(sEntrez,uniEntrez,DIPuni):
"""take an Entrez ID and map it back to a DIP ID using the two dictionaries:
uniEntrez - Entrez to uniprot dictionary
DIPuni - uniprot to DIP dictionary"""
uniprot = uniEntrez[sEntrez]
#have to make sure one of the uniprot IDs matches
for i in uniprot:
try:
DIP = DIPuni[i]
except KeyError:
pass
return DIP
Then just load in the Entrez-IDs and feed those into this function:
In [321]:
c = csv.reader(open("total.Entrez.txt"), delimiter="\n")
EntrezIDs = list(flatten(list(c)))
In [322]:
EntrezDIPdict = {}
for i in EntrezIDs:
EntrezDIPdict[i] = EntreztoDIP(i,uniEntrez,DIPuni)
Ah, because some of the IDs were lost when converting from DIP to uniprot (ie some were known by using the uniprot converter and some were known from just the original table...). Will have to use the original table at this point to get this mapping.
In [323]:
for l in splitstrings:
if len(l) > 1:
if "uniprot" in l[-1]:
DIPuni[l[-1].split(":")[-1]] = []
for l in splitstrings:
if len(l) > 1:
if "uniprot" in l[-1]:
DIPuni[l[-1].split(":")[-1]].append(l[0])
And, as I feared, this is also true of the refseq values. So we need a uniprot to refseq dictionary and then we can build a refseq to DIP dictionary that we can use in the try statement above.
In [325]:
%%bash
mv refsequniprotmap.tab refseqtouniprot.tab
In [348]:
# uniprot - refseq
c = csv.reader(open("refseqtouniprot.tab"), delimiter="\t")
c.next()
cl = list(c)
unirefseq = {}
for l in cl:
unirefseq[l[1]] = []
for l in cl:
unirefseq[l[1]].append(l[0])
for l in splitstrings:
if len(l) > 2:
if "uniprot" in l[-1]:
DIPuni[l[-1].split(":")[-1]] = []
for l in splitstrings:
if len(l) > 2:
if "uniprot" in l[-1]:
DIPuni[l[-1].split(":")[-1]].append(l[1])
In [349]:
# refseq - DIP
refseqDIP = {}
for l in splitstrings:
if len(l) > 1:
if "refseq" in l[1]:
refseqDIP[l[1].split(":")[-1]] = []
for l in splitstrings:
if len(l) > 1:
if "refseq" in l[1]:
refseqDIP[l[1].split(":")[-1]].append(l[0])
In [343]:
def EntreztoDIP(sEntrez,uniEntrez,DIPuni,refseqDIP,unirefseq):
"""take an Entrez ID and map it back to a DIP ID using the two dictionaries:
uniEntrez - Entrez to uniprot dictionary
DIPuni - uniprot to DIP dictionary"""
uniprot = uniEntrez[sEntrez]
#have to make sure one of the uniprot IDs matches
DIP = []
for i in uniprot:
try:
DIP += DIPuni[i]
except KeyError:
try:
refseq = unirefseq[i]
for j in refseq:
try:
DIP += refseqDIP[j]
except KeyError:
pass
except KeyError:
pass
if DIP == []:
raise ValueError('Failed to Retreive DIP ID for Entrez-ID %s'%sEntrez)
return DIP
In [344]:
EntrezDIPdict = {}
for i in EntrezIDs:
EntrezDIPdict[i] = EntreztoDIP(i,uniEntrez,DIPuni,refseqDIP,unirefseq)
So finally it appears to run, but there's still something going wrong with it:
In [339]:
print EntrezDIPdict['12015']
Why's it returning a refseq value? Tracing.
Refreshed above initialisation of DIPuni dictionary and now works. Annoying because now I don't really know how the error slipped in...
In [353]:
uniprot = uniEntrez['100009298']
print uniprot
#DIP = DIPuni[uniprot[0]]
#print DIP
refseq = unirefseq[uniprot[0]]
print refseq
Looking at EntrezDIPdict some Entrez-IDs still don't map to DIP IDs. Don't really understand how this can happen.
Could only happen if there are lines in splitstrings that only contain refseq...
Or it could be because I'm still in the mouse directory and I'm using splitstrings, which was for human...
Changing directory above.
In [354]:
for l in splitstrings:
if "DIP" not in l[0]:
print l
In [5]:
from Bio import Entrez
Entrez.email = "gavingray1729@gmail.com"
In [294]:
print baitids[3]
In [11]:
efetch_handles = map(lambda i: Entrez.efetch(db="nucleotide", id=i, rettype="gb", retmode="text"),baitids)
In [9]:
from Bio import SeqIO
In [15]:
ncbi_record = SeqIO.read(efetch_handles[3], 'genbank')
print ncbi_record
As Colin pointed out, there shouldn't be a bull gene in there, so something's gone wrong. Is the above gene even in the list?
In [298]:
baitids.index('63682')
Looks like it's not...
So I must be using this Biopython function wrong.